Again, first of all, lets read some data

Read some wind data



In [ ]:

    
# first, the imports
import os
import datetime as dt

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display

np.random.seed(19760812)
%matplotlib inline



In [ ]:

    
# We read the data in the file 'mast.txt'
ipath = os.path.join('Datos', 'mast.txt')

def dateparse(date, time):
    YY = 2000 + int(date[:2])
    MM = int(date[2:4])
    DD = int(date[4:])
    hh = int(time[:2])
    mm = int(time[2:])
    
    return dt.datetime(YY, MM, DD, hh, mm, 0)
    

cols = ['Date', 'time', 'wspd', 'wspd_max', 'wdir',
        'x1', 'x2', 'x3', 'x4', 'x5', 
        'wspd_std']
wind = pd.read_csv(ipath, sep = "\s*", names = cols, 
                   parse_dates = [[0, 1]], index_col = 0,
                   date_parser = dateparse)

Selecting data

Access the elements like if it was a numpy array

We can access the elements using indexing like if it was a numpy array or how we do it in Python:

In Python, indexing starts by 0 and the last element of the slice is not included.



In [ ]:

    
wind[0:10]

Indexes in a numpy array can only be integers.

Access the elements through indexing the labels of the index

Also, unlike numpy, we can access indexes that are not integers:



In [ ]:

    
wind['2013-09-04 00:00:00':'2013-09-04 01:30:00']

In this second example indexing is made using strings, that are the representation on the indexes (labels). We can also highlight that, in this case, the last element in the slice IS INCLUDED.

Basic selection of a column (`DataFrame`)

In previous examples, we havel also seen that we could select columns using its name:



In [ ]:

    
wind['wspd'].head(3)

Depending how are defined the column names we can access the column values using dot notation but this way not always work so I strongly recommend not to use it:



In [ ]:

    
# Thi is similar to what we did in the previous code cell
wind.wspd.head(3)



In [ ]:

    
# An example that can raise an error
df1 = pd.DataFrame(np.random.randn(5,2), columns = [1, 2])
df1



In [ ]:

    
# This will be wrong
df1.1



In [ ]:

    
# In order to use it we have to use
df1[1]

Fancy indexing with `Series`

You can also use Fancy indexing with Series, like if we were indexing with a list or a boolean array:



In [ ]:

    
# Create a Series
wspd = wind['wspd']

# Access the elements located at positions 0, 100 and 1000
print(wspd[[0, 100, 1000]])
print('\n' * 3)

# Using indexes at locations 0, 100 and 1000
idx = wspd[[0, 100, 1000]].index
print(idx)
print('\n' * 3)

# We access the same elements than initially but using the labels instead 
# the location of the elements
print(wspd[idx])

With DataFrames the fancy indexing can be ambiguous and it will raise an IndexError.



In [ ]:

    
# Try it...

Boolean indexing

Like with numpy, we can access values using boolean indexing:



In [ ]:

    
idx = wind['wspd'] > 35
wind[idx]

We can use several conditions. for instance, let's refine the previous result:



In [ ]:

    
idx = (wind['wspd'] > 35) & (wind['wdir'] > 225)
wind[idx]

Using conditions coud be less readable. Since version 0.13 you can use the query method to make the expression more readable.



In [ ]:

    
# To make it more efficient you should install 'numexpr'
# tht is the default engine. If you don't have it installed
# and you don't define the engine ('python') you will get an ImportError
wind.query('wspd > 35 and wdir > 225', engine = 'python')

Using these ways of selection can be ambiguous in some cases. Let's make a parenthetical remark to come bacllater to see more advanced ways of selection.

Remark: Aligment of data when we operate with `pandas` data structures

When we perform an operation between two pandas data structures we get a very practical alignment effect. Let's see this by examples:



In [ ]:

    
s1 = pd.Series(np.arange(0,10), index = np.arange(0,10))
s2 = pd.Series(np.arange(10,20), index = np.arange(5,15))
print(s1)
print(s2)

Now, if we perform an operation between both Series, where there are the same index we can perform the operation and where there are no indexes on both sides of the operation we conserve the index in the result but the operation could not be performed and a NaN is returned but we will not get an error:



In [ ]:

    
s1 + s2

Coming back to indexing (recap)

One of the basic features of pandas is the rows and columns index labeling, this can make that indexing could be more complex than in numpy. We have to distinguish between:

selecting by label
selecting by position (numpy)

Indexing in Series is simpler as the labels refer to row labels (indexes) as there is only one column. As we have been learning in a vague manner, for a DataFrame, basic indexing select columns.

To select only a column, as we have seen previously:



In [ ]:

    
wind['wspd_std']

Or we can select several columns:



In [ ]:

    
wind[['wspd', 'wspd_std']]

But with slicing we will access the indexes:



In [ ]:

    
wind['2015/01/01 00:00':'2015/01/01 02:00']

So the following will provide an error:



In [ ]:

    
wind['wspd':'wdir']



In [ ]:

    
wind[['wspd':'wdir']]

Uh, what a mess!!

Indexing (à la `pandas`)

We have several available methods to index in a pandas data structure:

loc: it is used when we use the columns and rows labels to index (it also accepts boolean arrays).
iloc: this option is based in element positions (like if it was a numpy array).
ix: it is a combination of both previous methods.

This methods are also available in Series but with Series are not so useful as indexing is not ambiguous.

Let's see how these methods work in a DataFrame...

Select the first three items in columns 'wspd' and 'wspd_max':



In [ ]:

    
wind.loc['2013-09-04 00:00:00':'2013-09-04 00:20:00', 'wspd':'wspd_max']



In [ ]:

    
wind.iloc[0:3, 0:2] # similar to indexing a numpy arrays wind.values[0:3, 0:2]



In [ ]:

    
wind.ix[0:3, 'wspd':'wspd_max']

A fourth way not seen before would be:



In [ ]:

    
wind[0:3][['wspd', 'wspd_max']]



In [ ]:

    
wind[['wspd', 'wspd_max']][0:3]

Let's practice all of this

Return all the January 2014 values
Compute the mean wind speed during february 2014
Use the query method to obtain all wind speeds coming from North (in a range between $\pm$ 10 º considering North oriented towards North 0º) and with a wind speed above 10 m/s
The same as before but using a boolean array
All the previous problems can be solved loc, iloc and/or ix. Practice all the possibilities.



In [ ]:

Last curiosity in case you work with time series

pandas data structures have a method to select between times:



In [ ]:

    
wind.between_time('00:00', '00:30').head(20)



In [ ]:

    
# It also works with series:
wind['wspd'].between_time('00:00', '00:30').head(20)